Prior Adaptive Semi-supervised Learning with Application to EHR Phenotyping

Electronic Health Record (EHR) data, a rich source for biomedical research, have been successfully used to gain novel insight into a wide range of diseases. Despite its potential, EHR is currently underutilized for discovery research due to its major limitation in the lack of precise phenotype information. To overcome such difficulties, recent efforts have been devoted to developing supervised algorithms to accurately predict phenotypes based on relatively small training datasets with gold standard labels extracted via chart review. However, supervised methods typically require a sizable training set to yield generalizable algorithms, especially when the number of candidate features, p, is large. In this paper, we propose a semi-supervised (SS) EHR phenotyping method that borrows information from both a small, labeled dataset (where both the label Y and the feature set X are observed) and a much larger, weakly-labeled dataset in which the feature set X is accompanied only by a surrogate label S that is available to all patients. Under a working prior assumption that S is related to X only through Y and allowing it to hold approximately, we propose a prior adaptive semi-supervised (PASS) estimator that incorporates the prior knowledge by shrinking the estimator towards a direction derived under the prior. We derive asymptotic theory for the proposed estimator and justify its efficiency and robustness to prior information of poor quality. We also demonstrate its superiority over existing estimators under various scenarios via simulation studies and on three real-world EHR phenotyping studies at a large tertiary hospital.


Introduction
Electronic Health Records (EHRs) provide a large and rich data source for biomedical research aiming to further our understanding of disease progression and treatment response.EHR data has been successfully used to gain novel insights into a wide range of diseases, with examples including diabetes (Brownstein et al., 2010), rheumatoid arthritis (Liao et al., 2014), inflammatory bowl disease (Ananthakrishnan et al., 2014), and autism (Doshi-Velez et al., 2014).EHR is also a powerful discovery tool for identifying novel associations between genomic markers and multiple phenotypes through analyses such as phenome-wide association studies (Denny et al., 2010;Kohane, 2011;Wilke et al., 2011;Cai et al., 2018).
estimator through regularization with penalty functions reflecting the prior knowledge.When the prior assumption holds exactly, we show that the unlabeled data can naturally be used to assist in the estimation of the phenotype model.Allowing the prior assumption to hold approximately or to be highly violated, our prior adaptive semi-supervised (PASS) estimator adaptively incorporate the prior knowledge by shrinking the estimator towards a direction derived under the prior.
The proposed PASS estimator is similar to the prior LASSO (pLASSO) procedure of Jiang et al. (2016) in spirit in that both approaches aim to incorporate prior information into the 1 penalized estimator in a high-dimensional setting.The differences are, nevertheless, substantial and clear.Jiang et al. (2016) assumed that the prior information was summarized into prediction values and contributed to the likelihood term.In contrast, we use prior information to guide the shrinkage and put them into the penalty term.In this sense, PASS and pLASSO complement each other to some extent.However, as shown in both theory and simulations, putting prior information into the likelihood term tends to lead to the "take it or leave it" phenomenon: the usefulness of the prior information is determined based on the overall effect of all predictors.On the other hand, by putting prior information into the penalty term, the PASS approach provides more flexible control: the data is able to scrutinize the individual effect of each predictor.This gained flexibility can result in improved theoretical and numerical performances.
The rest of this paper is organized as follows.We discuss the motivation, an important special scenario and the general methodology in Section 2. We analyze the theoretical properties of the proposed approach in Section 3, and access its finite sample performance via simulation studies in Section 4. Furthermore, we illustrate the practical value of the proposed approach on three real EHR datasets in Section 5. Finally, we conclude this paper with some discussions and extensions in Section 6.All technical proofs and additional numerical results are given in the Supplementary Materials.

Setup
We assume that the underlying data consists of N independent and identically distributed (i.i.d.) observations {(Y i , S i , X i ) = (Y i , W i ) , i = 1, . . ., N }, where Y i is a binary indicator of the disease status of the ith patient, S i is a scalar surrogate variable that is reasonably predictive of Y i chosen via domain knowledge, and X i is a p-dimensional feature vector.Examples of S i includes the total count of ICD codes or mentions for the disease of interest in clinical notes extracted via natural language processing (NLP).Candidate features X may include the ICD9 code counts for competing diagnosis, lab results, as well as NLP mentions of relevant signs/symptoms, medications and procedures.We may also include various transformations of original features in X to account for non-linear effects.While {W i , i = 1, ..., N } is fully observed, Y i is only observed for a random subset of n patients.Hence the observed data are L ∪ U , where without loss of generality, the first n observations are assumed fully observed as L = {(Y i , W i ) , i = 1, . . ., n}, and the rest constitute the unlabeled data as U = {W i , i = n + 1, . . ., N }.
Throughout, for a d-dimensional vector u, the q -norm of . ., p}, then v J denotes a d-dimensional vector whose jth element is v j 1 j∈J , and 1 B is the indicator function for set B. The independence between random variables/vectors U and V is written as U ⊥ ⊥ V .We also denote the negative log-likelihood function associated with the logistic model with (y, η) = −yη + log(1 + e η ).

Model Assumptions
To predict Y using W = (S, X ) , we assume where for any vector w, w = (1, w ) , and π(t) = e t /(1 + e t ).To leverage the data in U , we further assume a single index model (SIM) for S | X, i.e. there exists α 0 ∈ R p such that where X α 0 is a single linear combination of the features X and f is an unknown link function.
Here ζ 0 , γ 0 , β 0 and α 0 are parameters to be estimated where only the direction of α 0 is identifiable and its norm does not affect our construction introduced below.If α 0 and β 0 are similar in certain ways, one would expect that the unlabeled data U may be used to improve upon the standard supervised estimator for β 0 using L alone.For example, if S is a noisy representation of Y with random measurement error, then it is reasonable and common in the EHR literature (Hong et al., 2019;Zhang et al., 2020, e.g.) to assume Under (C prior ), we have Proposition 1 with proof given in Supplementary Materials.
Proposition 1.Under (M Y ), (M S ), (C prior ), and assume E(XX ) is positive-definite, and it holds that: (C1) for any two vectors Remark 1. Condition (C1) holds for elliptical distributions including multivariate normal.By Diaconis and Freedman (1984) and Hall and Li (1993), this assumption tends to hold for non-elliptical design when the dimensionality is high.Specifically, one can show that under mild regularity conditions, for two projection vectors a 1 and a 2 uniformly randomly drawn from S p−1 = {v ∈ R p−1 : v 2 = 1}, the pair (X a 2 , X a 1 ) weakly converges to a bivariate normal distribution with high probability, and thus E(X a 2 | X a 1 ) is at least approximately linear in X a 1 ; see Theorem 1.1 of Diaconis and Freedman (1984) and equation (1.9) of Hall and Li (1993).
Proposition 1 hinges on the main result of Li and Duan (1989) that when the features X satisfy (C1), the direction of the coefficients of a SIM could be estimated using least squares regression for the response against X.It suggests that U can greatly improve the estimation of β 0 under (C prior ) because the phenotype model (M Y ) may be rewritten as logit Pr(Y = 1|W) = ζ + Sγ + ρX α for some ρ.Under this model, a simple SS estimator for ζ, γ and β in (M Y ) can be obtained as ζ, γ and ρ α, where By doing so, the direction of the high dimensional vector β is estimated based on the entire data L ∪ U , and only the parameters (ζ, γ, ρ) are estimated using the small labeled data L .Hereafter we shall refer to this SS estimator derived under (C prior ) as SS prior .
Nevertheless, SS prior is only valid when (C prior ) and (C1) holds exactly.Our goal is to develop a more robust SS estimator under (M Y ) and (M S ) that can efficiently exploit U when (C prior ) and (C1) may only hold approximately.In this more general setting, a desirable SS estimator should improve upon the standard supervised estimator when the directions of α 0 and β 0 are similar in their magnitude and/or support.In addition, it should perform similarly to the supervised estimator when the two directions are not close.We shall now detail our PASS estimation procedure which automatically adapts to different cases as reflected in the observed data.

Prior Adaptive Semi-Supervised (PASS) Estimator
With L only, a supervised estimator for β can be obtained via the standard 1 -penalized regression: (1) With properly chosen λ, the consistency and rate of convergence for θ has been established (van de Geer, 2008).To improve the estimation of β through leveraging U , we note that when (C prior ) holds approximately, the magnitude of β 0 − ρα 0 is small for some ρ, and the support of β 0 − ρα 0 is of small size as well.
To incorporate such prior belief on the relationship between α 0 and β 0 , we construct the penalty term min where A 0 = supp(α 0 ), and λ 1 , λ 2 > 0 are tuning parameters.Since (α 0 ) A c 0 = 0, the penalty term is equivalent to The first term in the penalty measures how far β is from the closest vector along the α 0 direction, and hence encourages smaller magnitude of β − ρα 0 .The second term shrinks β A c 0 towards 0, which reflects our prior that predictors irrelevant to S are likely to be irrelevant to Y as well.The tuning parameters λ 1 , λ 2 control the strength of the belief imposed.When they are sufficiently large, β will be forced to be a multiple of α 0 and thus it ends up with the same estimator as in the case where (C prior ) holds.
Appending the penalty term (2) to the likelihood and replacing α 0 with its estimate α, we propose to estimate ϑ 0 = (ζ 0 , γ 0 , β 0 ) by where A = supp( α).The estimators can be equivalently obtained as The impact of the tuning parameters λ 1 , λ 2 can be understood from a bias-variance tradeoff viewpoint.When λ j 's are large, β tends to be a multiple of α and thus is an estimator with high bias and low variance.In contrast, when λ j 's are small, the likelihood term based on the labeled data L is the dominant part, and hence β will have low bias and high variance.By varying the values of λ j 's, we are able to obtain a continuum connecting these two extremes.In practice, λ 1 and λ 2 can be chosen via standard data-driven approaches such as the cross-validation.

Theoretical Properties
In this section, we present non-asymptotic risk bounds for the PASS estimator.We also make theoretical comparisons with the supervised LASSO estimator to shed light on when PASS outperforms the LASSO and where such improvement comes from.

Main result
We first establish the risk bounds for the PASS estimator in the following theorem.Its proof can be found in Section S2 of the Supplementary Materials.
Remark 2. As detailed in Section S2.1 of the Supplement Material, Assumptions (A1)-(A8) are imposed on tail behaviour of the regression residuals, regularity of the design matrix, minimum signal strength of α * , sample sizes and rates of the tuning parameters.These assumptions are commonly used conditions in the theoretical literature of LASSO, such as the sub-Gaussian variable condition and the restricted eigenvalue condition; see e.g.van de Geer and Bühlmann (2009); Bickel et al. (2009); Bühlmann and Van De Geer (2011).
).In general, as long as N max(n, p) and α * is not much denser than β 0 as in the typical EHR application cases, O(λ 1 ∆ α |ρ * |) is dominated by the risk of the supervised LASSO estimator and even the supervised oracle estimator obtained under the knowledge of supp(β 0 ).
To gain a better understanding of how the key quantity Ξ in Theorem 1 changes with respect to the similarity between the prior information α * and the target β 0 , we shall discuss several specific cases in Section 3.3, based on the risk bound derived in Theorem 1.

Specific Cases
Following Remark 3, we focus our discussions on the settings where N is sufficiently large such that the last term of the risk bound is negligible.We consider three different scenarios as illustrated in Figure 1: (Case 1) α * recovers both the support and direction of β 0 ; (Case 2) α * almost recovers the support of β 0 but has a substantially different direction from β 0 ; (Case 3) α * fails to recover the support of β 0 (let alone its direction) and provides poor information.These three cases depict perfect, good, and poor qualities of the prior information α * in recovering the support and direction of β 0 .Next, we rigorously characterize the three cases by properly specifying the parameters ρ, δ, S + , and S -, and derive the convergence rate of Ξ, the risk bound of the PASS estimator, based on Theorem 1.
, S + = P and S -= supp(δ 0 ).If α * successfully recover the support and direction of β 0 (see the left panel in Figure 1), S -≈ ∅ and δ 0 1 ≈ 0. Since G 1/2 (θ − θ 0 ) 2 = 0 and S + ∩ A * = ∅, we have Ξ = O{E (θ, S + , S -)} by the definition of θ * .Hence by Theorem 1, the excess risk of θ As a standard result (Negahban et al., 2009), the rate of the excess risk of the supervised LASSO estimator is either O p {n −1 log(p)|B 0 |} or O{n −1/2 log(p) 1/2 β 0 1 }.These two rate bounds are established under different sparsity norms of β 0 , and generally comparable, e.g. when order of average magnitude of the non-zero entries in β 0 is n −1/2 log(p) 1/2 .In comparison with them, O p (n −1 ), the risk rate of PASS in Case 1, is much more refined.Further, O p (n −1 ) is actually the rate of the estimator of a low (fixed) dimensional logistic regression.Thus, if β 0 is very close to a multiple of α * , PASS could outperform the vanilla LASSO and be comparable with a low dimensional regression in terms of the convergence rate.This big gain is owing to the use of N unlabeled data to obtain the direction of β 0 , and thus reduce the high dimensional regression to a low dimensional one where only the intercept and the scalar of β 0 need to be estimated.
Case 2. Consider the same choice of θ, S + and S -as in Case 1.If α * recovers the support but not the direction of β 0 (see the middle of Figure 1), we will only have δ S-∩A * c 1 ≈ 0 but not δ 0 1 ≈ 0. Then by Theorem 1, the excess risk of PASS is In Case 2, the convergence rate of the excess risk of PASS is still better than that of the supervised LASSO estimator when q * p: . Namely, if α * might not recover the direction of β 0 very well but the prior information A * = supp(α * ) is sparse and covers supp(β 0 ) successfully, which is reflected as S + = P, the PASS estimator still benefits from the prior information.This is because recovering the support of β 0 reduces the dimensionality of the empirical errors needed to be controlled from p to q * = |A * |.In this case, it is also interesting to compare the proposed PASS estimator with the prior LASSO (pLASSO) procedure of Jiang et al. (2016).When supp(α * ) and supp(β 0 ) are close but the directions of α * and β 0 are quite different, the pLASSO procedure is unable to utilize this information and will only result in the same convergence rate as supervised LASSO, as shown to be essentially slower than that of PASS.
In Case 3, the excess risk of the PASS is of the same order as that of supervised LASSO.Therefore the PASS approach is robust against low-quality prior information that recovers neither the direction nor the support of β 0 .This benefit is a result of using a data-adaptive parameter ρ to control the influence of the prior information on the estimator.

Main setups
We conducted extensive simulation studies to examine the finite-sample performance of the PASS estimator and compare to existing estimators.We first consider the case where the logistic model for Y | S, X is correctly specified, S | X follows an SIM, and X is near elliptical, but the similarity between α 0 and β 0 varies.Since EHR features are often zero inflated and skewed count variables, we generated X 500×1 from where [u] denotes the integer nearest to u, Σ Z = (σ i,j ) p i,j=1 and σ i,j = 4(0.5)|i−j| .Here [e Z ij ] mimics the skewed raw EHR feature, which is typically transformed via t → log(1 + t) prior to model fitting.We then generated the surrogate S from a SIM of X: To mimic different qualities of the prior information one could encounter in practice, we design six scenarios with different similarities between the true β 0 and α 0 : Our specifications of β 0 and α 0 are motivated by the three key specific cases introduced in Section 3.3 and illustrated in Figure 1.Scenario I is the ideal case where β 0 and α 0 have identical direction.
In Scenario II, most of the components of β 0 differ slightly from a scalar multiple of α 0 , while a few components differs substantially.Scenarios I and II are designed to examine the performance of PASS estimator when the prior information is highly or somewhat reliable.In Scenario III, α 0 is denser than β 0 and contains quite a few weak signals.On the contrary, in Scenario IV β 0 is denser than α 0 .In Scenario V, the magnitude of α 0 and β 0 are quite different, whereas they still share the same support.Scenarios III, IV and V are designed to examine the performance of PASS estimator with respect to different degree of accuracy of the support information.In Scenario VI, both the magnitude and the support of α 0 and β 0 differs substantially, which means the unlabeled data provides little information.This scenario allows us to see whether the PASS estimator is robust against unreliable prior information.See Figure S1 in the Supplementary Materials for a visualization of β 0 and ρα 0 across different scenarios.
We compare PASS to following existing methods: (1) supervised LASSO penalized logistic regression with n training samples (LASSO n ); (2) supervised ALASSO penalized logistic regression with n training samples, denoted by ALASSO n ; (3) the SS prior estimator as described in section 2.2; and (4) two variants of pLASSO estimators as proposed in Jiang et al. (2016).For pLASSO, we fit a penalized logistic model with an LASSO penalty imposed on predictors outside supp( α), as in equation ( 8) of Jiang et al. (2016), and then use the predicted probability from that model as Y p i in equation ( 7) of Jiang et al. (2016), denoted by pLASSO 1 ; (2) use the predicted probability given by the SS prior approach as Y p i in equation ( 7) of their paper, denoted by pLASSO 2 .Throughout, we let N = 10000 and let ν = 1 in the ALASSO weights.We use Bayesian information criterion (BIC) to select µ init and µ in the estimation of α due to large N , and use 10fold cross-validation to select λ 1 , λ 2 for the estimation of β, so that the phenotype model is tuned towards prediction performance.We quantify the average prediction performance of the estimated linear score, ϑ W, with ϑ obtained via different methods in an independent test data with size 10000.For each choice of ϑ W, we consider the area under the receiver operating characteristic curve (AUC) for classifying Y , the excess risk (ER) as defined in Section 3, and the mean squared error of the predicted probabilities (MSE-P) which is the mean squared differences between the predicted probability and the true probability.We summarize results based on 1000 simulated datasets for each configuration.
In Figures 2, we compare prediction measures for estimators obtained with n = 100.In Scenario I where the directions of β 0 and α 0 coincide, the SS prior approach performs the best as expected, yet the proposed PASS method attained very similar accuracy followed by pLASSO 2 which performed only slightly worse.When the directions of β 0 and α 0 are somewhat different as in Scenario II, the SS prior and the pLASSO estimators deteriorate quickly.In contrast, the PASS estimator maintains high accuracy and outperforms all competing estimators substantially.We observe qualitatively similar patterns for Scenarios III and IV under which α 0 and β 0 have somewhat different support.No matter whether α 0 is denser than β 0 as in Scenario IV, or β 0 is denser than α 0 as in Scenario V, the PASS method consistently outperforms the supervised estimators.Additionally, the performances of the SS prior and pLASSO approaches are not quite satisfactory.In Scenario V, β 0 and α 0 have the same support but are quite different in terms of magnitude.The proposed method managed to utilize the same-support information, whereas the pLASSO approaches failed to do so.Finally, the goal of Scenario VI is to examine the robustness of the methods when β 0 and α 0 differs a lot, possibly due to the use of an inappropriate surrogate.The PASS estimator performs similarly to the supervised estimators, indicating that our procedure is indeed adaptive to how well the data support the prior assumption.Across all scenarios, the ALASSO approach performs slightly worse than LASSO, possibly due to the presence of some small nonzero coefficients in β 0 .
In Figures 3, we present the AUC, ER and MSE-P of the PASS estimator trained with n = 100 and the supervised LASSO estimator with varying label size.In Scenario I where the prior assumption holds exactly, PASS 100 , the PASS approach with 100 labeled samples, even outperforms LASSO 400 , the LASSO approach with 400 labeled samples.When the prior assumption holds approximately as in Scenarios II through V, PASS 100 consistently outperforms LASSO 150 , and Mean performance of the PASS approach are marked using red dash lines for ease of comparison.The size of the labeled data is n = 100 for PASS, while it varies for LASSO, as indicated in the subscripts.
achieves similar performance as LASSO 200 , which requires twice as many labels.Finally, in Scenarios VI where the prior information is highly inaccurate, the PASS method maintains comparable performance against LASSO 100 .

Efficiency and Robustness Evaluations under Mis-specifications
We conducted simulation studies under three additional scenarios to further investigate the efficiency and robustness of PASS when the model assumptions and elliptical design assumptions are violated.We again set p = 500 and generate X i = 2Φ(Z i ) − 1, where Φ(•) is the cumulative distribution function of the standard normal, Z i = (Z i1 , . . ., Z ip ) ∼ N (0, Σ Z ), Σ Z = (σ i,j ) p i,j=1 , σ i,j = (0.5) |i−j| if i = j or both i and j are ≤ 20 or both i and j are > 20 and σ i,j = 0 otherwise.We make Σ Z block-diagonal for the convenience of obtaining the population solution of β and α through the best logistic or least square approximation under model mis-specifications.In real EHR studies, a paradigm of data generation is that the features X, e.g.some genetic variants, precedes the disease status Y , and Y precedes some clinical surrogate S, e.g. the count of ICD codes associated with the disease.To mimic this scenario, we generated Y i and S i from the following models: Assumptions (C prior ) and (M S ) hold when η 1 = η 2 = 0, and would be severely violated when η 1 and η 2 are large.We design three scenarios with η 1 and η 2 representing different degrees of violation on the surrogate assumptions: i: µ = 1, and η 1 = η 2 = 0; ii: µ = 1.5, η 1 = (a 3 , 0 p−5 ) , and η 2 = (d 3 , 0 p−5 ) ; iii: µ = 2, η 1 = (a 3 , a 3 , a 3 , 0 p−15 ) , and where a 3 = (0.6, −0.4,0.4, 0.5, −0.5) and d 3 = (0.3, 0.4, 0.6, −0.5, −0.5) .Here µ depict the marginal effect of Y i on S i , and are set to make the AUC of target models at a similar level across the three scenarios.Across all scenarios, P (Y i = 1 | S i , X i ) is no longer a parametric logistic model, i.e. (M Y ) is misspecified.Our goal is to estimate the limiting coefficients ζ 0 , γ 0 , β 0 defined as the minimizor of E (Y i , ζ +γS i +X i β).Benchmark methods, and their implementation, tuning, and evaluation procedures are the same as in Section 4.1, except that we implement supervised LASSO with n ranging from 100 to 700.
In Figure 4, we present AUC, ER and MSE-P of the methods under Scenarios i-iii.In Scenario i, PASS has similar performances as the semi-supervised benchmarks SS prior and pLASSO and all the semi-supervised estimators significantly outperform the two supervised estimators since (C prior ) holds and α * basically recover the direction of β 0 well.Among the semi-supervised estimators, the SS prior and pLASSO 2 estimators have a slight advantage with smaller variation as expected since both heavily rely on the prior information which is of high quality in this setting.In Scenario ii, the key assumption (C prior ) is violated, which drastically impacts the performance of SS prior and pLASSO 2 .On the other hand, PASS and pLASSO 1 still effectively leverage the imperfect information from α * to approximately recover the support of β 0 , and thus outperform SS prior , pLASSO 2 , and the supervised methods.In Scenario iii, η 1 and η 2 become denser than those in Scenario ii.This can make the recovery of supp(β 0 ) using supp(α * ) less accurate, and interestingly, PASS outperforms all methods including pLASSO 1 that also leverages supp(α * ).In all the three scenarios, Figure 4: Evaluation metrics on the test set for simulation studies under Scenarios i-iii introduced in Section 4.2.Outliers are not drawn.Mean performance of the PASS approach are marked using red dash lines for ease of comparison.On the left panel, we present the evaluation metrics of all methods for comparison when n = 100.On the right panel, we compare the performance of PASS when n = 100 with supervised LASSO obtained using labelled samples with various n (from 100 to 700).PASS significantly outperforms supervised LASSO using the same or even 2-3 more times of sample sizes, which displays a large gain of using the unlabelled data to assist the regression.Finally, the results demonstrate that our method can still efficiently leverage the prior information from S in estimating the target parameters when S | Y, X highly depends on X so C prior is violated, (M Y ) is misspecified, and the design is non-elliptical.

Application to EHR Phenotying
We examine the performance of PASS along with other approaches in three real world EHR phenotyping studies with the goal of developing classification models for the diseases of interest.All studies are performed at a large tertiary hospital system with EHR data spanning over multiple decades.Each study has n 0 labeled observations for algorithm training and validation.We consider three choices of training size n no more than n 0 /2 in all examples.First, we randomly split the labelled samples into four folds of equal sizes.Then we pick each fold as the validation set, sample n training labels from the other three folds for 20 times, train and validate the algorithms, and finally average the evaluation metrics and their standard errors over the validation results on the four folds.We replicate this procedure for 10 times and report the average performance.
Data Example 1 (CAD Phenotyping).The goal of this study is to identify patients with coronary artery disease (CAD) based on their EHR features.The study cohort consists of N = 4164 patients, out of which a random subset of n 0 = 181 patients have their true CAD status annotated via chart review by domain experts.We use the sum of the counts for the CAD ICD code and NLP mention of CAD as the surrogate.There are p = 585 additional EHR features consisting of the total count of all ICD codes as a healthcare utilization measure, 10 ICD codes related to CAD, and 574 NLP variables.For the size of training labels, we consider n = 50, 70, 90.This de-identified dataset has been analyzed in previous studies (Zhang et al., 2019, e.g.) and is publicly available online: https://celehs.github.io/PheCAP/articles/example2.html.
Data Example 2 (RA Phenotyping).Similar to the CAD phenotyping study, the goal is to identify patients with rheumatoid arthritis (RA) based on their EHR features.There are N = 46114 patients in total and out of which, n 0 = 435 patients have their RA status annotated.Again, we choose the sum of the ICD code and NLP mention of RA as the surrogate.The p = 924 additional EHR features consist of the healthcare utilization and 923 NLP variables potentially predictive of RA.
For the size of training labels, we consider n = 50, 125, 200.

Data Example 3 (Depression Phenotyping
).The goal is to identify patients with depression based on their codified EHR features.There are N = 9474 patients in total and n 0 = 236 labeled observations.The surrogate is chosen as the counts of depression ICD code.There are p = 231 additional EHR features, including the healthcare utilization and 230 codified EHR features on depression related medication prescriptions, laboratory tests and ICD codes.For the size of training labels, we consider 50, 85, 120.
In the three data examples, N is significantly larger than p with N/ max(p, n) being approximately 7 for CAD, 50 for RA, and 41 for Depression.In all the three studies, we apply x → log(1+x) transformation for all count variables.Also, since patients with higher healthcare utilization tend to have higher counts of most features, we orthogonalize all features against the healthcare utilization before regression fitting.Since ϑ 0 is unknown in applications, we quantify the performance of an estimator ϑ based on the AUC and Brier skill score (BSS) of π( ϑ W) for predicting Y , where the BSS is defined as 1 , and E v denotes the empirical expectation on the validation sample.The BSS is essentially a binary version of the R-square.
For comparison, we included PASS, SS prior , pLASSO 2 , supervised LASSO and ALASSO on the three data examples to estimate the phenotyping model (M Y ).We exclude pLASSO 1 since it requires fitting of an unpenalized regression on supp( α), which is infeasible when | supp( α)| > n.
In addition, we compare to the unsupervised LASSO (ULASSO) approach of Chakrabortty et al. (2017), which estimates direction of the logistic coefficients for Y ∼ π(β X) by regressing I(S > c u ) against X on the subset whose S is either greater than c u or smaller than c l , for some pre-specified c u and c l typically chosen such that P (S > c u ) and P (S < c l ) are small.Since the ULASSO approach only provides an estimate β to optimize the prediction of β X for Y | X without using S explicitly as an additional predictor, we also derive a semi-supervised variant of ULASSO, denoted by SS ULASSO , by regressing the labeled Y against β X and S as for SS prior .
As shown in Figure 5, PASS significantly outperforms the supervised LASSO and ALASSO when n = 50 on all the three examples.As the label size n increases, their performances get closer.Compared with the semi-supervised benchmarks, PASS has slightly or moderately better performance on the CAD and RA studies.For Depression, PASS substantially outperforms them, especially SS prior and SS ULASSO .For example, when n = 50, PASS attained average AUC in classifying depression about 0.1 higher than that of SS prior and SS ULASSO and 0.05 higher than pLASSO.The gap becomes smaller when n increases as expected.Interestingly, the supervised estimators outperform pLASSO, SS prior , and SS ULASSO on the Depression data as well but has similar or worse performance than these semi-supervised approaches on the other two examples.This could in part be attributed to the relatively poor quality of the surrogate information, which makes existing semisupervised approaches fail.In contrast, PASS could utilize such prior information more effectively and robustly, and still preserves better performance than the supervised estimators.Thus, we can conclude that incorporating prior information from the unlabeled data could improve and stabilize the prediction performance of phenotyping models in EHR applications, and PASS is more robust and efficient in leveraging the prior information compared with existing semi-supervised methods.In addition, ULASSO shows much worse performance than the other supervised and semi-supervised methods on all examples.This illustrates the importance of collecting labels and including the surrogate in the regression models for EHR phenotyping.

Discussion
In this paper, we propose PASS, a high dimensional sparse estimator adaptively incorporating the prior knowledge from surrogate under a semi-supervised scenario commonly found in application fields like EHR analysis.Compared to the supervised approaches, the proposed PASS approach can substantially reduce the required number of labeled samples when the model assumptions (M S ) and (C prior ) and the elliptical design assumption (C1) hold exactly or approximately, and thus the prior information α * is trustworthy.Compared to existing pLASSO and SS prior approaches that also incorporates prior information, the PASS approach is robust against unreliable prior information α * , which might be the case when the surrogate model assumptions are violated or the design X is highly non-elliptically distributed.
One of the main challenges in our theoretical analysis comes from the colinearity of covariates (1, S i , X i α, X i ) due to the enrollment of ρ to leverage the prior information in α.We overcome this by properly constructing the oracle coefficients θ * and the restricted eigenvalue assumption (A6).The formulation of our problem falls into the missing data framework with missing completely at random.However, the missing probability approaches 1 as N → ∞.This together with the high   Method BSS dimensionality of X makes the theoretical justifications more challenging than those used in the standard missing data literature.Without prior assumptions of β 0 − ρα 0 being sparse in certain sense, the unlabeled data cannot directly contribute to the estimation of β 0 .Our proposed PASS procedure hinges on the sparsity of β 0 − ρα 0 to leverage the unlabeled data.
We have restricted the discussion to a single surrogate variable for simplicity.However, the proposed method can be easily extended to multiple surrogates.Specifically, consider K surrogates, denoted by S [1] , . . ., S [K] .Let α [k] be the ALASSO estimator regressing S [k]  i against X i , A = ∪ K k=1 supp( α k ), S i = (S [1]  i , . . ., S [K]  i ) and ρ = (ρ [1] , . . ., ρ [K] ) .We can obtain an estimator for the model parameters as Theoretical justification and finite sample performance of β under this setting warrant further research.In addition, it may be interesting to extend the semi-supervised PASS estimator under a high dimensional sparse parametric regression to semi-parametric settings such as sparse additive model (Ravikumar et al., 2009) and sparse varying coefficient model (Noh and Park, 2010).Under semi-parametric models, one could still leverage prior information through shrinking the coefficients to "ρ α" with some sparse penalty function, to gain statistical efficiency.Studying the specific forms and theoretical properties of such approaches via a semi-supervised framework warrants future research.
R codes for implementing PASS and the benchmark methods, and replicating the simulation results can be found at https://github.com/moleibobliu/PASS.

Figure 1 :
Figure 1: Sketch diagrams of the coefficients β 0 and α * in the three cases of Section 3.3.Labels S + , S − , and A * in the diagrams represent S + \ P, S -, and A * as chosen and defined in Cases 1-3.β 0 and α * are aligned for comparison of their directions and supports.Symbols , . .., , and represent zero, ellipsis of zero coefficients, a unit of magnitude in β 0 , and a unit of magnitude in α * respectively.Case 1 (presented in the left panel): α * recovers both the support and direction of β 0 .Case 2 (middle): α * (nearly) recovers the support but not the direction β 0 .Case 3 (right): α * fails to recover the support of β 0 .

Figure 2 :Figure 3 :
Figure2: AUC (left), ER (middle) and MSE-P (right) evaluated on the test set for simulation studies under Scenarios I-VI.Outliers are not drawn.Mean performance of the PASS approach are marked using dashed lines for ease of comparison.The size of the labeled data is fixed at n = 100.

Figure 5 :
Figure 5: Out of sample AUC and BSS on the data examples 1-3, with various sizes of labelled training samples denoted as n.Median performance of PASS are marked using red dash line for ease of comparison.